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Human language, the most powerful communication system in history, is closely associated with 
cognition. Written text is one of the fundamental manifestations of language, and the study of 
its universal regularities can give clues about how our brains process information and how we, 
as a society, organize and share it. Still, only classical patterns such as Zipf's law have been 
explored in depth. In contrast, other basic properties like the existence of bursts of rare words in 
specific documents, the topical organization of collections, or the sublinear growth of vocabulary size 
with the length of a document, have only been studied one by one and mainly applying heuristic 
methodologies rather than basic principles and general mechanisms. As a consequence, there is 
a lack of understanding of linguistic processes as complex emergent phenomena. Beyond Zipf's 
law for word frequencies, here we focus on Heaps' law, burstiness, and the topicality of document 
collections, which encode correlations within and across documents absent in random null models. 
We introduce and validate a generative model that explains the simultaneous emergence of all these 
patterns from simple rules. As a result, we find a connection between the bursty nature of rare 
words and the topical organization of texts and identify dynamic word ranking and memory across 
documents as key mechanisms explaining the non trivial organization of written text. Our research 
can have broad implications and practical applications in computer science, cognitive science, and 
linguistics. 



Introduction 

Even in the era of the information technology revolu- 
tion, language remains the most powerful and sophisti- 
cated communication system in the history of civiliza- 
tion pQ. Its understanding requires an interdisciplinary 
approach and has broad conceptual and practical impli- 
cations. It involves a number of disciplines; from com- 
puter science, where natural language processing 
seeks to model language computationally, to cognitive 
science, that tries to understand our intelligence with lin- 
guistics as one of its key contributing disciplines [5] . 

After speech, written text is probably the most fun- 
damental manifestation of human language. Nowadays, 
electronic and information technology media offer the op- 
portunity of recording and accessing easily huge amounts 
of documents that can be analyzed in quest for some of 
the signatures of human communication. As a first step, 
statistical patterns in written text can be detected as a 
trace of the mental processes we use in communication. 
It has been realized that various universal regularities 
characterize text from different domains and languages. 
The best-known is Zipf's law on the distribution of word 
frequencies [5J [B] , according to which the frequency of 
terms in a collection decreases inversely to the rank of the 
terms. Zipf's law has been found to apply to collections of 
written documents in virtually all languages. Other no- 
table universal regularities of text are Heaps' law [51110). 
according to which vocabulary size grows slowly with 
document size, i.e. as a sublinear function of the num- 
ber of words; and the bursty nature of words [TT 1 [T2 | H3] . 
making a word more likely to reappear in a document if it 



has already appeared, compared to its overall frequency 
across the collection. 

Understanding the structure of written text is key 
to a broad range of critical applications such as Web 
search P3MT5] (and the booming business of online adver- 
tising) , literature mining [T51 [H] , topic detection [T51 [TS] , 
and security j2Ql EJ [22] ■ Thus, it is not surprising that 
researchers in linguistics, information and cognitive sci- 
ence, machine learning, and complex systems are coming 
together to model how universal text properties emerge. 
Different models have been proposed that are able to 
predict each of the universal properties outlined above. 
However, no single model of text generation explains all 
of them together. Furthermore, no model has been used 
to interpret or predict the empirical distributions of text 
similarity between documents in a collection [231 EI] . 

In this paper we present a model that generates collec- 
tions of documents consistently with all of the above sta- 
tistical features of textual corpora, and validate it against 
large and diverse Web datasets. We go beyond the global 
level of Zipf's law, which we take for granted, and focus 
on general correlation signatures within and across doc- 
uments. These correlation patterns, manifesting them- 
selves as burstiness and similarity, are destroyed when the 
words in a collection are reshuffled, even while the global 
word frequencies are preserved. Therefore the correla- 
tions are not simply explained by Zipf's law, and are di- 
rectly related to the global organization and topicality of 
the corpora. The aim of our model is not to reproduce the 
microscopic patterns of occurrence of individual words, 
but rather to provide a stylized generative mechanism 
to interpret their emergence in statistical terms. Conse- 
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FIG. 1: Regularities in textual data as observed in our three empirical datasets. (a) Zipf's Law: word counts are globally 
distributed according to a power law P(f g ) ~ fg 2 - ( D ) Heaps' Law: as the number of words n in a document grows, the average 
vocabulary size (i.e. the number of distinct words) w(n) grows sublinearly with n. (c) Burstiness: fraction of documents P(fd) 
containing fd occurrences of common or rare terms. For each dataset, we label as "common" those terms that account for 71% 
of total word occurrences in the collection, while rare terms account for 8%. (d) Similarity: distribution of cosine similarity s 
across all pairs of documents, each represented as a term frequency vector. Also shown are w(n), the distributions of fd, and 
the distribution of s according to the Zipf null model (see text) corresponding to the IS dataset. 



quently, our main assumption is a global distribution of 
word probabilities; we do not need to fit a large num- 
ber of parameters to the data, in contrast to parametric 
models proposed to describe the bursty nature or topi- 
cality of text |25l |26j [27]. In our model, each document 
is derived by a local ranking of dynamically reordered 
words, and different documents are related by sharing 
subsets of these rankings that represent emerging topics. 
Our analysis shows that the statistical structure of text 
collections, including their level of topicality, can be de- 
rived from such a simple ranking mechanism. Ranking is 
an alternative to preferential attachment for explaining 
scale invariance [28j and has been used to explain the 
emergent topology of complex information, technologi- 
cal, and social networks [22] . The present results suggest 
that it may also shed light on cognitive processes such as 
text generation and the collective mechanisms we use to 
organize and store information. 



Empirical observations 

We have selected three very diverse public datasets, 
from topically focused to broad coverage, to illustrate 
the statistical regularities of text and validate our model. 
The first corpus is the Industry Sector database (IS), a 
collection of corporate Web pages organized into cate- 
gories or sectors. The second dataset is a sample of the 
Open Directory (ODP), a collection of Web pages classi- 
fied into a large hierarchical taxonomy by volunteer edi- 
tors. The third corpus is a random sample of topic pages 
from the English Wikipedia (Wiki), a popular collabora- 
tive encyclopedia which also is comprised of millions of 
online entries. (See Appendix A for details.) 

We measured the statistical regularities mentioned 
above in our datasets and the empirical results are shown 
in Fig. 1. The distributions of document length for all 
three collections is very well approximated by a log- 
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normal, with different first and second moment parame- 
ters (see Table 1 and Fig. f in Appendix A). According 
to Zipf's law [SI EJ [HI EH], the global frequency f g of 
terms in a collection decreases roughly inversely to their 
rank r: f g ~ 1/r or, in other words, the distribution of 
the frequency f g is well approximated by a power law 
P(fg) ~ fg~ a with exponent a « 2. This regularity has 
been found to apply to collections of written documents 
in virtually all languages, including the datasets used 
here (Fig. la). Heaps' law (9] [10] describes the sublinear 
growth of vocabulary size (number of unique words) w as 
a function of the size of a document (number of words) n 
(Fig. lb). This regularity has also been observed in dif- 
ferent languages, and the behavior has been interpreted 
as a power law w(n) ~ rfi with /3 < 1, although the ex- 
ponent (3 between 0.4 and 0.6 is language-dependent [3~T] . 

Burstiness is the tendency of some words to occur clus- 
tered together in individual documents, so that a term 
is more likely to reappear in a document where it has 
appeared before [TTJ [T2] H3J . This property is more evi- 
dent among rare words, which are more likely to be top- 
ical. Following Elkan [37], the bursty nature of words 
can be illustrated by dividing words into classes accord- 
ing to their global frequency (e.g., common vs. rare). 
For words in each class, we plot in Fig. lc the fraction 
P(fd) of documents in which these words occur with fre- 
quency fd- We compare the distribution P(fd) of com- 
mon and rare terms with those predicted by the null in- 
dependence hypothesis, that generates documents whose 
length is drawn from the same lognormal distribution as 
the empirical data (see Table 1 and Fig. 1 in Appendix 
A) by drawing words independently at random from the 
global Zipf frequency distribution (Fig. la). As compared 
to the reference of such a Zipf model, rare terms are much 
more likely to cluster in specific documents and not to 
appear evenly distributed in the collection, so that or- 
dering principles beyond those responsible for Zipf's law 
have to be at play. 

Another signature of text collections, which is more 
telling about topicality, is the distribution of lexical sim- 
ilarity across pairs of documents. In information retrieval 
and text mining, documents are typically represented as 
term vectors [13 |32] . Each element of a vector represents 
the weight of the corresponding term in the document. 
There are various vector representations according to dif- 
ferent weighting schemes. Here, we focus on the simplest 
scheme, in which a weight is simply the frequency of the 
term in the document. The similarity between two doc- 
uments is given by the cosine between the two vectors: 

s (P,<l) = Et w t P Wt q /^J2t w t P ■ Et w t q i where w tp is the 
weight of term t in document p. It has been observed 
that for documents sampled from the ODP, the distribu- 
tion of cosine similarity based on term frequency vectors 
is concentrated around zero and decays in a roughly ex- 
ponential fashion for s > j23j [24]. Figure Id shows 
that different collections yield different similarity pro- 
files, however they all tend to be more skewed toward 
small similarity values than predicted by the Zipf model. 



Modeling how these properties emerge from simple 
rules is central to an understanding of human language 
and related cognitive processes. Our understanding, 
however, is far from definitive. First, because the em- 
pirical observations are open to different interpretations. 
As an example, much has been written about the debate 
between Simon and Mandelbrot around different inter- 
pretations of Zipf's law (see www.nslij-genetics.org/ 
|wli/zipf] for a historical review of the debate). Second, 
and perhaps more importantly, no single model of text 
generation explains all of the above observations simul- 
taneously. Third, models at hand are usually based on 
heuristic methods rather than on basic principles and 
general mechanisms that could explain linguistic pro- 
cesses as emergent phenomena. 

In the remainder of this paper, we focus on bursti- 
ness and similarity distributions. Regarding similar- 
ity, little attention has been given to its empirical 
distribution and, to the best of our knowledge, no 
model has been put forth to explain its profile. Re- 
garding text burstiness, on the other hand, several 
models have been proposed including the two-Poisson 
model |11| . the Poisson zero-inflated mixture model [33] . 
Katz' k-mixture model |12j . and a gap-based variation 
of Bayes model [53]. Another line of generative models 
extends the simple multinomial family with increasingly 
complex views of topics. Examples include probabilistic 
latent semantic indexing |35j . latent Dirichlct allocation 
(LDA) [25], and Pachinko allocation [33]. These mod- 
els assume a set of topics, each typically described by a 
multinomial distribution over words. Each document is 
then generated from some mixture of these topics. In 
LDA, for example, the parameters of the mixture are 
drawn from a Dirichlet distribution, independently for 
each document. Each word in a document is generated 
by drawing a topic from the mixture and then the term 
from its corresponding word distribution. A variety of 
techniques have been developed to estimate from data 
the parameters that characterize the many distributions 
involved in the generative process [3T] [25J [37] . Although 
the above models were mainly developed for subject clas- 
sification, they have also been used to investigate bursti- 
ness since bursty words can characterize the topic of a 
document [2"7| 135], 

The very large numbers of free parameters associ- 
ated with individual terms, topics, and/or their mix- 
tures grant the above models great descriptive power. 
However, their cognitive plausibility is problematic. Our 
aim here is instead to produce a simpler, more plausi- 
ble mechanism compatible with the high-level statistical 
regularities associated with both burstiness and similarity 
distributions, without regard for explicit topic modeling. 



Model and results 

Two basic mechanisms, reordering and memory, can 
explain burstiness and similarity consistently with Zipf's 
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law. We show this by proposing a generative model 
that incorporates these processes to produce collections 
of documents characterized by the observed statistical 
regularities. Each document is derived by a local rank- 
ing of words that reorganizes according to the changing 
word frequencies as the document grows, and different 
documents are related by sharing subsets of these rank- 
ings that represent emerging topics. With just the main 
assumptions of the global distribution of word probabil- 
ities and document sizes and a single tunable parameter 
measuring the topicality of the collection, we are able to 
generate synthetic corpora that re-create faithfully the 
features of our Web datasets. Next, we describe two vari- 
ations of the model, one without memory and the second 
with a memory mechanism that captures topicality. 



Dynamic ranking by frequency 

In our model, D documents are generated drawing 
word instances repeatedly with replacement from a vo- 
cabulary of V words. The document lengths in number 
of words are drawn from a lognormal distribution. The 
parameters D, V and the lognormal mean and variance 
are derived empirically from each dataset (see Table 1 
in Appendix A). We further assume that at any step of 
the generation process, word probabilities follow a Zipf 
distribution P[r(t)] oc where r(t) is the rank of 

term t. However, rather than keeping a fixed ranking, we 
imagine that words are sorted dynamically during the 
generation of each document according to the number of 
times they have already occurred. Words and ranks are 
thus decoupled: at different times, a word can have differ- 
ent ranks and a position in the ranking can be occupied 
by different words. The idea is that as the topicality of 
a document emerges through its content, topical words 
will be more likely to reoccur within the same document. 
This idea is incorporated into the model as a frequency 
bias favoring words that occur early in the document. 

In the first version of the model, each document is 
produced independently of each other. Before each new 
document is generated, words are sorted according to an 
initial global ranking, which remains fixed for all docu- 
ments. This ranking tq is also used to break ties during 
the generation of documents, mong words with the same 
occurrence counts. The algorithm corresponding to this 
dynamic ranking model is illustrated in Fig. 2 and de- 
tailed in Appendix C. 

When a sufficiently large number of documents is gen- 
erated, the measured frequency of a word t over the entire 
corpus approaches the Zipf distribution P(t) ~ [ro(i)] _1 , 
ensuring the self consistency of the model. We numer- 
ically simulated the dynamic ranking model for each 
dataset. A direct comparison with the empirical bursti- 
ness curves shown in Fig. lc can be found in Fig. 3a. 
The excellent agreement suggests that the dynamic rank- 
ing process is sufficient for producing the right amount 
of correlations inside documents needed to realistically 



account for the burstiness effect. 

Heaps' law can be derived analytically from our model 
(see Appendix B). Assuming a Zipf's law with a tail 
of the form P(r) ~ r~ 7 where 7 > 1, the solution is 
w(n) ~ rt 1 / 7 and we recover Heaps' sublinear growth 
with /3 « 1/7 for large n. According to the Yule-Simon 
model [39], which interprets Zipf's law through a prefer- 
ential attachment process, the rank distribution should 
have a tail with exponent 7 > 1 . This is confirmed empir- 
ically in many English collections; for example our ODP 
and Wikipedia datasets yield Zipfian tails with 7 between 
3/2 and 2. Our model predicts that in these cases Heaps' 
growth should be well approximated by a power law with 
exponent j3 between 1/2 and 2/3, closely matching those 
reported for the English language |31j . Simulations us- 
ing the empirically derived P(r) for each dataset display 
growth trends for large n that are in good agreement with 
the empirical behavior (Fig. 3b). 



Topicality and similarity 

The agreement between empirical data and simulations 
of the model with respect to the similarity distributions 
gets worse for those datasets that are more topically fo- 
cused. A new mechanism is needed to account for topical 
correlations between documents. 

The model in the previous section generates collec- 
tions of independent text documents, with specific but 
uncorrelated topics captured by the bursty terms. For 
each new document, the rank of each word t is initialized 
to its original value ro(t) so that each document has no 
bias toward any particular topic. The synthetic corpora 
which result display broad coverage. However, real cor- 
pora may cover more or less specific topics. The stronger 
the semantic relationship between documents, the higher 
the likelihood they share common words. Such collection 
topicality needs to be taken into account to accurately 
reproduce the distribution of text similarity between doc- 
uments. 

To incorporate topical correlations into our model, 
we introduce a memory effect connecting word frequen- 
cies across different documents. Generative models with 
memory have already been proposed to explain Heaps' 
law [ID]. In our algorithm (see Fig. 2 and Appendix C) 
we replace the initialization step so that a portion of the 
initial ranking of the terms in each document is inherited 
from the previously generated document. In particular, 
the counts of the r* top-ranked words are preserved while 
all the others are reset to zero. The rank r* is drawn from 
an exponential distribution P(r*) = z(l — z) r _1 where 
z is a probability parameter that models the lexical di- 
versity of the collection and r* has expected value 1/z, 
which can be interpreted as the collection's shared topi- 
cality. 

This variation of the model does not interfere with the 
reranking mechanism described in the previous section, 
so that the burstiness effect is preserved. The idea is to 
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FIG. 2: Illustration of the dynamic ranking model. The parameter z regulates the lexical diversity, or topicality of the collection. 
The extreme case z — is equivalent to the null Zipf model, where all documents are generated using the global word rank 
distribution. The opposite case * = 1 is the first version of the dynamic ranking model, with no memory, in which each 
new document starts from the global word ranking Tq. Intermediate values of z represent the more general version of the 
dynamic ranking model, where correlations across documents are created by a partial memory of word ranks. A more detailed 
algorithmic description of the model can be found in Appendix C. 



interpolate between two extreme cases. The case z = 0, 
in which counts are never reset, converges to the null 
Zipf model. All documents share the same general terms, 
modeling a collection of unspecific documents. Here we 
expect a high similarity in spite of the independence 
among documents, because the words in all documents 
are drawn from the identical Zipf distribution. The other 
extreme case, 2=1, reduces to the original model, where 
all the counts are always initialized to zero before starting 
a document. In this case, the bursty words are numerous 
but not the same across different documents, modeling 
a situation in which each document is very specific but 
there is no shared topic across documents. Intermediate 
cases < z < 1 allow us to model correlations across 
documents not only due to the common general terms, 
but also to topical (bursty) terms. 

We simulated the dynamic ranking model with mem- 
ory under the same conditions corresponding to our 
datasets, but additionally fitting the parameter z to 
match the empirical similarity distributions. The com- 
parisons are shown in Fig. 3c. The similarity distribu- 
tion for the ODP is best reproduced for z = 1, in accor- 



dance to the fact that this collection is overwhelmingly 
composed of very specific documents spanning all top- 
ics. In such a situation, the original model accurately 
reproduces the high diversity among document topics 
and there is no memory need. In contrast, Wikipedia 
topic pages use a homogenous vocabulary due to their 
strict encyclopedic style and the social consensus mecha- 
nism driving the generation of content. This is reflected 
in the value z — 0.005, corresponding to an average of 
l/z = 200 common words whose frequencies are corre- 
lated across successive pairs of documents. The industry 
sector dataset provides us with an intermediate case in 
which pages deal with more focused, but semantically re- 
lated topics. The best fit of the similarity distribution is 
obtained for z — 0.1. 

With the fitted values for the shared topicality param- 
eter z, the agreement between model and empirical sim- 
ilarity data in Fig. 3c is excellent over a broad range of 
similarity values. To better illustrate the significance of 
this result, let us compare it with the prediction of a sim- 
ple topic model. For this purpose one must have a priori 
knowledge of a set of topics to be used for generating 
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FIG. 3: Model vs. empirical observations, (a) Comparison of burstiness curves produced by the dynamic ranking model with 
those from the empirical datasets. Common and rare words are defined in Fig. Ic. (b) Comparison of Heaps' law curves 
produced by the dynamic ranking model with those from the empirical datasets. Simulations of the model provide the same 
predictions as numerical integration of the analytically derived equation using the empirical rank distributions (see Appendix 
B). For the IS dataset we also plot the result of the Zipf null model, which produces a sublinear w(n), although less pronounced 
than our model. The ODP collection has short documents on average (cf. Table 1 in Appendix A), so Heaps' law is barely 
observable, (c) Comparison between similarity distributions produced by the dynamic ranking model with memory, and those 
from the empirical datasets also shown in Fig. Id. The parameter z controlling the topical memory is fitted to the data. The 
peak at a — suggests that the most common case is always that of documents sharing very few or no common terms. The 
discordance for high similarity values is due to corpus artifacts such as mirrored pages, templates, and very short (one word) 
documents. The fluctuations in the curves for the ODP dataset are due to binning artifacts for short pages. Also shown is the 
prediction of the topic model for the IS dataset (see text). 



the documents. The IS dataset lends itself to this anal- 
ysis because the pages are classified into twelve disjoint 
industry sectors, which can naturally be interpreted as 
unmixed topics. For each topic c, we measured the fre- 
quency of each term t and used it as a probability p(t\c) 
in a multinomial distribution. We generated the docu- 
ments for each topic using the actual empirical values for 
the number of documents in the topic and the number of 
words in each document. As shown in Fig. 3c, the result- 
ing similarity distribution is better than that of the Zipf 
model (where we assume a single global distribution), 
however the prediction is not nearly as good as that of 



our model. 

Our model only requires a single free parameter z plus 
the global (Zipfian) distribution of word probabilities, 
which determines the initial ranking. Conversely, for the 
topic model we must have — or fit — the frequency distri- 
bution p(t\c) over all terms for each topic, which implies 
an extraordinary increase in the number of free param- 
eters since, apart from potential differences in the func- 
tional forms, each distribution would rank the terms in 
a different order. 

Aside from complexity issues, the ability to recover 
similarities suggests that the dynamic ranking model, 
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though not as well informed as the topic model on the 
distributions of the specific topics, better captures word 
correlations. Topics emerge as a consequence of the cor- 
relations between bursty terms across documents as de- 
termined by z, but it is not necessary to predefine explic- 
itly the number of topics or their distributions as other 
models require. 



Conclusion 

Our results show that key regularities of written text 
beyond Zipf 's law, namely burstiness, topicality and their 
interrelation, can be accounted for on the basis of two 
simple mechanisms, namely frequency ranking with dy- 
namic reordering and memory accross documents, and 
can be modeled with an essentially parameter-free algo- 
rithm. The rank based approach is in line with other 
recent models in which ranking has been used to explain 
the emergent topology of complex information, techno- 
logical, and social networks [29 . It is not the first time 
that a generative model for text has walked parallel paths 
with models of network growth. A remarkable example is 
the Yule-Simon model of text generation |39j . which was 
later rediscovered in the context of citation analysis [40] , 
and has recently found broad popularity in the complex 
networks literature [4Tj . 

Our approach applies to datasets where the temporal 
sequence of documents is not important, but burstiness 
has also been studied in contexts where time is a critical 
component |13[ 142] , and even in human languages evolu- 
tion [43] . Further investigations in relation to topicality 
could attempt to explicitly demonstrate the role of the 
topicality correlation parameter by looking at the hier- 
archical structure of content classifications. Subsets of 
increasingly specific topics of the whole collection could 
be extracted to study how the parameter z changes and 
how it is related to external categorizations. The pro- 
posed model can also be used to study the coevolution 
of content and citation structure in the scientific litera- 
ture, social media such as the Wikipedia, and the Web 
at large US EH 021 05] ■ 

From a broader perspective, it seems natural that mod- 
els of text generation should be based on similar cog- 
nitive mechanisms as models of human text processing 
since text production is a translation of semantic con- 
cepts in the brain into external lexical representations. 
Indeed, our model's connection between frequency rank- 
ing and burstiness of words provides a way to relate two 
key mechanisms adopted in modeling how humans pro- 
cess the lexicon: rank frequency [46] and context diver- 
sity |47j . The latter, measured by the number of docu- 
ments that contain a word, is related to burstiness since 
given a term's overall collection frequency, higher bursti- 
ness implies lower context diversity. While tracking fre- 
quencies is a significant cognitive burden, our model sug- 
gests that simply recognizing that a term occurs more 
often than another in the first few lines of a document 



would suffice for detecting bursty words from their rank- 
ing and consequently the topic of the text. 

In summary, a picture of how language structure and 
topicality emerge in written text as complex phenomena 
can shed light into the collective cognitive processes we 
use to organize and store information, and find broad 
practical applications, for instance, in topic detection, 
literature analysis, and Web mining. 
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APPENDIX A: WEB DATASETS 

We use three different datasets. The Industry Sec- 
tor database (IS) is a collection of almost 10,000 corpo- 
rate Web pages organized into 12 categories or sectors. 
The second dataset is a sample of the Open Directory 
( |dmoz . org] ODP), a collection of Web pages classified 
into a large hierarchical taxonomy by volunteer editors. 
While the full ODP includes millions of pages, our col- 
lection comprises of approximately 150,000 pages, sam- 
pled uniformly from all top-level categories and crawled 
from the Web. The third corpus is a random sam- 
ple of 100,000 topic pages from the English Wikipedia 
(en.wikipedia.org, Wiki), a popular collaborative en- 
cyclopedia which also is comprised of millions of online 
entries. 

These English text collections are derived from public 
data and are publicly available (IS dataset is available at 
www . cs . umass . edu/~mccallum/ code-data . html , ODP 
and Wikipedia corpora available upon request); have 
been used in several previous studies, allowing a cross 
check of our results; and are large enough for our 
purposes without being computationally unmanageable. 
The datasets are however very diverse in a number of 
ways. The IS corpus is relatively small and topically fo- 
cused, while ODP and Wikipedia are larger and have 
broader coverage, as reflected in their vocabulary sizes. 
IS documents represent corporate content, while many 
Web pages in the ODP collection are individually au- 
thored. Wikipedia topics are collaboratively edited and 
thus represent the consensus of a community. In spite of 
such differences, the distributions of document length for 
all three collections are very well approximated by log- 
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10 10 10 10 
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FIG. 4: Distributions of document length for all three collec- 
tions. Each is very well approximated by a log-normal, with 
different first and second moment parameters (see Table III. 



therefore 1 — F(w). Multiplying both sides of Eq. Bl by 
w + l and summing over w leads to a relation between the 
expected values E[w] of the number of different words for 
document sizes n + 1 and n: 

E[w(n + 1)] = E[w(n)} + E[l - F{w{n))] (B2) 

where the second term in the r.h.s. states that the proba- 
bility to observe a new word (when w different words are 
already present in the document) is the cumulative prob- 
ability of words with frequency ranking larger than w. 
Neglecting fluctuations and taking the continuous limit, 
Eq.~ 



B2 leads to 



dw(n) 
dn 



(n) 



P{r)dr. 



(B3) 



Eq. |B3| can be integrated numerically using the actual 
P(r) from the data. Alternatively, Eq. |B3| can be solved 
analytically for special cases (see main text). 



TABLE I: Statistics for the different document collections. V 
stands for vocabulary size, D for the number of documents 
containing at least one word (in parenthesis the number of 
empty documents in the collection), (w) for the average size 
of documents in number of unique words, and (n) and er 2 (n) 
for the average and variance of document size in number of 
words. For each collection, the distribution of document size 
is very well fitted by a lognormal with parameters /i and a 2 . 



Datasct 


V 


D 


(to) 


<n) 


a 2 (n) 




a 2 


Wiki 


588639 


100000 (0) 


160.44 


373.86 


457083 


5.20 


1.45 


IS 


47979 


9556 (15) 


124.26 


313.46 


566409 


4.79 


1.91 


ODP 


105692 


107360 (32558) 


8.88 


10.34 


345 


1.62 


1.44 



normals shown in Fig. |4j with different first and second 
moment parameters. Table|T]summarizes the main statis- 
tical features of the three collections. Before our analysis, 
all documents in each collection have been parsed to ex- 
tract the text (removing HTML markup) and syntactic 
variations of words have been conflated using standard 
stemming techniques [35] . 



APPENDIX B: ANALYTICAL DERIVATION OF 
HEAP'S LAW WITHIN OUR MODEL 

The probability P(w, n) to find w different words in a 
document of size n satisfies the following discrete master 
equation: 

P(w+l,n+l) = P(w+l,n)F(w+l)+P(w,ri)(l-F(w)) 

(Bl) 

where F(w) — X)r=i -^( r )> an( i P( r ) is the Zipf prob- 
ability associated with rank r. The tail of the Zipfian 
rank distribution is critical because the words not yet ob- 
served occupy the ranks at the bottom of the frequency 
distribution (r > w), and their cumulative probability is 



APPENDIX C: ALGORITHM 

The dynamic ranking model is implemented by the fol- 
lowing algorithm: 

Vocabulary: t £ {1, V} 

Initial ranking: V t : r (i) = t 

Repeat until D documents are generated: 

Initialize term counts to V t : c(t) = (*) 

Draw L from lognormal (//, a 2 ) 

Repeat until L terms are generated: 

Sort terms to obtain new rank r(t) 

according to c(t) (break ties by tq) 

Select term t with probability P(t) cx r{t) 

Add t to current document 

c(t) <- c(t) + 1 

End of document 

End of collection 

The document initialization step (line marked with an 
asterisk in above pseudocode) is altered in the more gen- 
eral, memory version of the model (see main text). In 
particular we set to zero the counts c(t) not of all terms, 
but only of terms t such that r(t) > r* . The rank 
r* is drawn from an exponential distribution P(r*) — 
z{l — z) r _1 where z is a probability parameter that mea- 
sures the lexical diversity of the collection and r* has ex- 
pected value l/z. In simpler terms, the counts of the r* 
top-ranked words are preserved while all the others are 
reset to zero. 
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Algorithmically, terms are sorted by counts so that the 
top-ranked term t (r(t) — 1) has the highest c(t). We iter- 
ate over the ranks r, flipping a biased coin for each term. 
As long as the coin returns false (probability 1 — z), we 
preserve c(t(r)). As soon as the coin returns true (prob- 
ability z), say for the term t(r*), we reset all the counts 
for this and the following terms: Vr > r* c(t(r)) = 0. 



The special case z = 1 reverts to the original, memory- 
less model; all counts are reset to zero and each document 
restarts from the global Zipfian ranking r . The special 
case z — is equivalent to the Zipf null model as the 
term counts are never reset and thus rapidly converge to 
the global Zipfian frequency distribution. 
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